Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[import] Separate dataset import logic into separate modules #171

Merged
merged 11 commits into from
Sep 14, 2017
Merged

Conversation

rexxars
Copy link
Member

@rexxars rexxars commented Sep 13, 2017

This PR removes the dataset import logic from @sanity/core and moves it into a separate module called @sanity/import. This makes it much easier to maintain and write tests for, and makes it much more reusable.

In the process, I've rewritten a lot (most) of the logic to make it much easier to work with and debug, with the tradeoff that it will use more memory and be slightly slower.

I've also made a separate CLI tool that is installable which will only handle the dataset import logic, @sanity/import-cli.

In the process of rewriting, a few major improvements have been made:

  • Assets are now hashed by content, not URL. This makes it much more reliable when it comes to determining whether or not the asset needs to be uploaded.
  • Objects within arrays now have keys added to them if not already set, to improve realtime-behaviour
  • References now get assigned the _type field if not set
  • "Import maps" are no longer generated as documents, but kept in-memory. Less clutter, at the expense of more memory usage.
  • Batching is now much smarter. Batch sizes are determined based on JSON payload size instead of number of documents, which should make huge payloads much less likely.

There are a couple more things that we should do to improve the import at some point, which I've outlined in the readme, but for the most part this should now work better and more reliably than the old import.

Sorry about the humongous size of the PR.

Copy link
Member

@bjoerge bjoerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a minor question, otherwise it looks really great!

// Create batches of documents to import. Try to keep batches below a certain
// byte-size (since document may vary greatly in size depending on type etc)
const batches = batchDocuments(weakened)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not clear to me, so just checking: importBatches, uploadAssets and strengthenReferences must be done in sequence, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of. importBatches must be done first, but uploadAssets and strengthenReferences could be done in parallel. The reason why we're not doing this at the moment is that if one of these operations fail, there is no way to stop the other. I've made a note of it in the readme as a definite improvement for the future, but it would require using something like a queue instead of simply mapping over items with a concurrency.

Copy link
Member

@bjoerge bjoerge Sep 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the clarification! And +1 for improved error handling in the future. If strengthenReferences fail right now, it will result in dangling assets too, or are the asset uploads idempotent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asset uploads are done first and connected to their corresponding documents. If this operation fails, it will have dangling assets, however. When re-running the import, it should find the same assets already uploaded and reuse those, however.

Failures on strengthened references will also leave some references as weak, and this is a situation which might easily occur if you are refering to documents which do not exist (incorrect IDs or similar). I've also made a note of this in the readme, that we should attempt to check if referenced documents actually exist before starting upload.

All in all, "rolling back" a failed upload is pretty hard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When re-running the import, it should find the same assets already uploaded and reuse those, however.

This is great. I have no further comments 👩‍⚖️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I have no further comments 👩‍⚖️

Achievement unlocked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants